Machine learning is the study that allows computers to adaptively improve their performance with experience accumulated from the data observed. Our two sister courses teach the most fundamental algorithmic, theoretical and practical tools that any user of machine learning needs to know. This first course of the two would focus more on mathematical tools, and the other course would focus more on algorithmic tools.

Machine Learning Foundations

Topic 1: When can machines learn?

###Lecture 1: The Learning Problem

A takes D and H to get g

Course Introduction

foundation oriented and story-like
What is Machine Learning

use data to approximate target
Applications of Machine Learning

almost everywhere
Components of Machine Learning

A takes D and H to get g
Machine Learning and Other Fields

related to Date Mining, Artificial Intelligence and Statstics

Lecture 2: Learning to Answer Yes/No

PLA A takes linear separable D and perceptrons H to get hypothesis g

Perceptron Hypothesis Set

hyperplanes/linear classifiers in Rd
Perceptron Learning Algorithm (PLA)

correct mistakes and improve iteratively
Guarantee of PLA

no mistake eventually if linear separable
Non-Separable Data

hold somewhat ‘best’ weights in pocket

Lecture 3: Types of Learning

focus: binary classification or regression from a batch of supervised data with concrete features

Learning with Different Output Space Y

[classification], [regression], structured
Learning with Different Data Label yn

[supervised], un/semi-supervised, reinforcement
Learning with Different Protocol f ⇒ (xn, yn)

[batch], online, active
Learning with Different Input Space X

[concrete], raw, abstract

Lecture 4: feasibility of learning

learning is PAC-possible if enough statistical data and finite |H|

Learning is Impossible?

absolutely no free lunch outside D
Probability to the Rescue

probably approximately correct outside D
Connection to Learning

verification possible if Ein(h) small for fixed h
Connection to Real Learning

learning possible if |H| finite and Ein(g) small

Topic 2: Why Can Machines Learn?

Lecture 5: training versus testing

effective price of choice in training: (wishfully) growth function mH(N) with a break point

Recap and Preview

two questions: Eout(g) ≈ Ein(g), and Ein(g) ≈ 0
Effective Number of Lines

at most 14 through the eye of 4 inputs
Effective Number of Hypotheses

at most mH(N) through the eye of N inputs
Break Point

when mH(N) becomes ‘non-exponential’

Lecture 6: theory of generalization

Eout ≈ Ein possible if mH(N) breaks somewhere and N large enough

Restriction of Break Point

break point ‘breaks’ consequent points
Bounding Function: Basic Cases

B(N,k) bounds mH(N) with break point k
Bounding Function: Inductive Cases

B(N,k) is poly(N)
A Pictorial Proof

mH(N) can replace M with a few changes

Lecture 7: the VC dimension

learning happens if finite dVC, large N, and low Ein

Definition of VC Dimension

maximum non-break point
VC Dimension of Perceptrons

dVC(H) = d + 1
Physical Intuition of VC Dimension

dVC ≈ #free parameters
Interpreting VC Dimension

loosely: model complexity & sample complexity

Lecture 8: noise and error

learning can happen with target distribution P(y|x) and low Ein w.r.t. err

Noise and Probabilistic Target

can replace f(x) by P(y|x)
Error Measure

affect ‘ideal’ target
Algorithmic Error Measure

user-dependent =⇒ plausible or friendly
Weighted Classification

easily done by virtual ‘example copying’

Topic 3: How Can Machines Learn?

Lecture 9: linear regression

analytic solution wLIN = X†y with linear regression hypotheses and squared error

Linear Regression Problem

use hyperplanes to approximate real values
Linear Regression Algorithm

analytic solution with pseudo-inverse
Generalization Issue

Eout − Ein ≈ 2(d+1) on average N
Linear Regression for Binary Classification

0/1 error ≤ squared error

Lecture 10: logistic regression

gradient descent on cross-entropy error to get good logistic hypothesis

Logistic Regression Problem

P(+1|x) as target and θ(wT x) as hypotheses
Logistic Regression Error

cross-entropy (negative log likelihood)
Gradient of Logistic Regression Error

θ-weighted sum of data vectors
Gradient Descent

roll downhill by −∇Ein(w)

Lecture 11: linear models for classification

binary classification via (logistic) regression; multiclass via OVA/OVO decomposition

Linear Models for Binary Classification

three models useful in different ways
Stochastic Gradient Descent

follow negative stochastic gradient
Multiclass via Logistic Regression

predict with maximum estimated P(k|x)
Multiclass via Binary Classification

predict the tournament champion

Lecture 12: nonlinear transformation

nonlinear X􏰡 via nonlinear feature transform Φ plus linear X with price of model complexity

Quadratic Hypotheses

linear hypotheses on quadratic-transformed data
Nonlinear Transform

happy linear modeling after Z = Φ(X )
Price of Nonlinear Transform

computation/storage/[model complexity]
Structured Hypothesis Sets

linear/simpler model first

Topic 4: How Can Machines Learn Better?

Lecture 13: hazard of overfitting

overfitting happens with excessive power, stochastic/deterministic noise, and limited data

What is Overfitting?

lower Ein but higher Eout
The Role of Noise and Data Size

overfitting ‘easily’ happens!
Deterministic Noise

what H cannot capture acts like noise
Dealing with Overfitting

data cleaning/pruning/hinting, and more

Lecture 14: regularization

minimizes augmented error, where the added regularizer effectively limits model complexity

Regularized Hypothesis Set

original H + constraint
Weight Decay Regularization

add λ/N*wTw in Eaug
Regularization and VC Theory

regularization decreases dEFF
General Regularizers

target-dependent, [plausible], or [friendly]

Lecture 15: validation

(crossly) reserve validation data to simulate testing procedure for model selection

Model Selection Problem

dangerous by Ein and dishonest by Etest
Validation

select with Eval(Am(Dtrain)) while returning Am∗ (D)
Leave-One-Out Cross Validation

huge computation for almost unbiased estimate
V -Fold Cross Validation

reasonable computation and performance

Lecture 16: three learning principle

Occam’s Razor

simple, simple, simple!
Sampling Bias

match test scenario as much as possible
Data Snooping

any use of data is ‘contamination’
Power of Three

relatives, bounds, models, tools, principles

Machine Learning Techniques

Topic 1: Embedding Numerous Features: Kernel Models

Lecture 1: Linear Support Vector Machine

linear SVM: more robust and solvable with quadratic programming

Course Introduction

from foundations to techniques
Large-Margin Separating Hyperplane

intuitively more robust against noise
Standard Large-Margin Problem

minimize ‘length of w’ at special separating scale
Support Vector Machine

‘easy’ via quadratic programming
Reasons behind Large-Margin Hyperplane

fewer dichotomies and better generalization

Lecture 2: Dual Support Vector Machine

dual SVM: another QP with valuable geometric messages and almost no dependence on d ̃

Motivation of Dual SVM

want to remove dependence on d ̃
Lagrange Dual SVM

KKT conditions link primal/dual
Solving Dual SVM

another QP, better solved with special solver
Messages behind Dual SVM

SVs represent fattest hyperplane

Lecture 3: Kernel Support Vector Machine

kernel as a shortcut to (transform + inner product) to remove dependence on d ̃: allowing a spectrum of simple (linear) models to infinite dimensional (Gaussian) ones with margin control

Kernel Trick

kernel as shortcut of transform + inner product
Polynomial Kernel

embeds specially-scaled polynomial transform
Gaussian Kernel

embeds infinite dimensional transform
Comparison of Kernels

linear for efficiency or Gaussian for power

Lecture 4: Soft-Margin Support Vector Machine

allow some margin violations ξn while penalizing them by C; equivalent to upper-bounding αn by C

Motivation and Primal Problem

add margin violations ξn
Dual Problem

upper-bound αn by C
Messages behind Soft-Margin SVM

bounded/free SVs for data analysis
Model Selection

cross-validation, or approximately nSV

Lecture 5: Kernel Logistic Regression

two-level learning for SVM-like sparse model for soft classification, or using representer theorem with regularized logistic error for dense model

Soft-Margin SVM as Regularized Model

L2-regularization with hinge error measure
SVM versus Logistic Regression

≈ L2-regularized logistic regression
SVM for Soft Binary Classification

common approach: two-level learning
Kernel Logistic Regression

representer theorem on L2-regularized LogReg

Lecture 6: Support Vector Regression

kernel ridge regression (dense) via ridge regression + representer theorem; support vector regression (sparse) via regularized tube error + Lagrange dual

Kernel Ridge Regression

representer theorem on ridge regression
Support Vector Regression Primal

minimize regularized tube errors
Support Vector Regression Dual

a QP similar to SVM dual
Summary of Kernel Models

with great power comes great responsibility

Topic 2: Combining Predictive Features: Aggregation Models

Lecture 7: Blending and Bagging

blending known diverse hypotheses uniformly, linearly, or even non-linearly; obtaining diverse hypotheses from bootstrapped data

Motivation of Aggregation

aggregated G strong and/or moderate
Uniform Blending

diverse hypotheses, ‘one vote, one value’
Linear and Any Blending

two-level learning with hypotheses as transform
Bagging (Bootstrap Aggregation)

bootstrapping for diverse hypotheses

Lecture 8: Adaptive Boosting

optimal re-weighting for diverse hypotheses and adaptive linear aggregation to boost ‘weak’ algorithms

Motivation of Boosting

aggregate weak hypotheses for strength
Diversity by Re-weighting

scale up incorrect, scale down correct
Adaptive Boosting Algorithm

two heads are better than one, theoretically
Adaptive Boosting in Action

AdaBoost-Stump useful and efficient

Lecture 9: Decision Tree

recursive branching (purification) for conditional aggregation of constant hypotheses

Decision Tree Hypothesis

express path-conditional aggregation
Decision Tree Algorithm

recursive branching until termination to base
Decision Tree Heuristics in C&RT

pruning, categorical branching, surrogate
Decision Tree in Action

explainable and efficient

Lecture 10: Random Forest

bagging of randomized C&RT trees with automatic validation and feature selection

Random Forest Algorithm

bag of trees on randomly projected subspaces
Out-Of-Bag Estimate

self-validation with OOB examples
Feature Selection

permutation test for feature importance
Random Forest in Action

‘smooth’ boundary with many trees

Lecture 11: Gradient Boosted Decision Tree

aggregating trees from functional gradient and steepest descent subject to any error measure

Adaptive Boosted Decision Tree

sampling and pruning for ‘weak’ trees
Optimization View of AdaBoost

functional gradient descent on exponential error
Gradient Boosting

iterative steepest residual fitting
Summary of Aggregation Models

some cure underfitting; some cure overfitting

Topic 3: Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network

automatic pattern feature extraction from layers of neurons with backprop for GD/SGD

Motivation

multi-layer for power with biological inspirations
Neural Network Hypothesis

layered pattern extraction until linear hypothesis
Neural Network Learning

backprop to compute gradient efficiently
Optimization and Regularization

tricks on initialization, regularizer, early stopping

Lecture 13: Deep Learning

pre-training with denoising autoencoder (non-linear PCA) and fine-tuning with backprop for NNet with many layers

Deep Neural Network

difficult hierarchical feature extraction problem
Autoencoder

unsupervised NNet learning of representation
Denoising Autoencoder

using noise as hints for regularization
Principal Component Analysis

linear autoencoder variant for data processing

Lecture 14: Radial Basis Function Network

linear aggregation of distance-based similarities using k-Means clustering for prototype finding

RBF Network Hypothesis

prototypes instead of neurons as transform

RBF Network Learning

linear aggregation of prototype ‘hypotheses’

k-Means Algorithm

clustering with alternating optimization

k-Means and RBF Network in Action

proper choice of # prototypes important

Lecture 15: Matrix Factorization

linear models of movies on extracted user features (or vice versa) jointly optimized with stochastic gradient descent

Linear Network Hypothesis

feature extraction from binary vector encoding

Basic Matrix Factorization

alternating least squares between user/movie

Stochastic Gradient Descent

efficient and easily modified for practical use

Summary of Extraction Models

powerful thus need careful use

Lecture 16: Finale

Feature Exploitation Techniques

kernel, aggregation, extraction, low-dimensional
Error Optimization Techniques

gradient, equivalence, stages
Overfitting Elimination Techniques

(lots of) regularization, validation
Machine Learning in Practice

welcome to the jungle

Notes

Terms
- 正则岭回归
- 伪逆矩阵
- 离散函数
- 拉格朗日对偶法
- 拉格朗日最小化
- 拉格朗日乘法
- 空间转换
- 原始SVM
- 对偶SVM
- Decision Stump
L1-5 25
L2-19
L4-18 19
L5-4 5 9 15 17
L7基于以上的两个例子，我们得到了aggregation的两个优势：feature transform和regularization。我们之前在机器学习基石课程中就介绍过，feature transform和regularization是对立的，还把它们分别比作踩油门和踩刹车。如果进行feature transform，那么regularization的效果通常很差，反之亦然。也就是说，单一模型通常只能倾向于feature transform和regularization之一，在两者之间做个权衡。但是aggregation却能将feature transform和regularization各自的优势结合起来，好比把油门和刹车都控制得很好，从而得到不错的预测模型。
L07-20 bootstrap aggregation (BAGging): a simple meta algorithm on top of base algorithm A
L07-21 值得注意的是，只有当演算法对数据样本分布比较敏感的情况下，才有比较好的表现。
L08-20
L09-10
L10-2Bagging能减小variance，而Decision Tree能增大variance
L11-10
L12-6感知机模型每个节点的输入就对应神经元的树突dendrite，感知机每个节点的输出就对应神经元的轴突axon
L12-13 16GD 21
L14-5 9 10